Producing charts for ‘The Changing Fortunes of the Richest Countries in Grand Slam Tennis’

By: Dr. Chris Martin

Tools / packages used: R, R Markdown, ggplot2, tidyverse (inc. dplyr and tidyr), plotly.

Techniques used: exploratory data analysis, functional programming (purrr package), data visualisation, data cleaning/reshaping/manipulation.

Chart types used: area chart, line chart, bar chart, stacked bar chart, small multiples, heatmap, ridge chart, interactive charts (with ggplotly).

Source data: To produce the charts, I needed data on the women’s and men’s singles entrants for each Grand Slam tournament since 1990. This came from the excellent Tennis Abstract.

This notebook produces the static data visualisations which features in my data storytelling project: The Changing Fortunes of the Richest Countries in Grand Slam Tennis. You can read the full story on my website.

A note on my data visualistion workflow

The chart produced in this notebook are ‘skeletons’ with fairly minimal styling, but all the key structural components in places. The chart are exported from this notebook as svgs. These can are then editted - adding textures, photos, annotations etc. - using graphic design software to create the final versions.

Setting up the notebook

# import packages
library(tidyverse)  # for data manipulation and viz
library(knitr)      # for formatting tables

# set default theme for exploratory plots
theme_set(theme_light())  # using a minimal theme to make it easier to edit 
                          # the plots in graphic design software later on

# set default R markdown chunk options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

Reading in the data

The data is read in from csvs produced in data_clean.Rmd, a lot of the data preparation was done in that notebook.

gs_first_round_gdp <- read_csv("../data/results_gdp.csv") %>% 
  #just look up until covid pandemic  (as will have distorted things)
  filter(year < 2020)

# check data looks as expected
gs_first_round_gdp %>% 
  head() %>% 
  kable()
year tourney_name tour name id ioc country gdp_per_capita iso
1990 Australian Open atp Jim Pugh 101004 USA United States 40436.94 USA
1990 Australian Open atp Ivan Lendl 100656 USA United States 40436.94 USA
1990 Australian Open atp Cyril Suk 101327 CZE Czechia 23585.18 CZE
1990 Australian Open atp Tomas Carbonell 101507 ESP Spain 27543.92 ESP
1990 Australian Open atp Michael Brown B395 101895 AUS Australia 31016.42 AUS
1990 Australian Open atp Karel Novacek 101120 CZE Czechia 23585.18 CZE
gs_entries_by_country <- read_csv("../data/gs_entries_by_country.csv") %>% 
  # just look up until covid pandemic  (as will have distorted things)
  filter(year < 2020)

# check data looks as expected
gs_entries_by_country %>% 
  head() %>% 
  kable()
year tourney_name country_code country gdp_per_capita num_first_rd income_decile top_20_perc
1990 Australian Open USA United States 40436.94 55 10 TRUE
1990 Australian Open AUS Australia 31016.42 36 9 TRUE
1990 Australian Open FRA France 33732.02 20 9 TRUE
1990 Australian Open GER Germany 36699.48 20 9 TRUE
1990 Australian Open SWE Sweden 34156.82 14 9 TRUE
1990 Australian Open CZE Czechia 23585.18 10 8 FALSE

How have the richest 20% of countries performed at Grand Slams (1990 - 2019)?

This section focuses on how well the richest 20% of countries performed at Grand Slams (1990 - 2019). The metric used for a country’s performance is how many players they had appearing in the first round of Grand Slam tennis tournaments.

The overall trend

The performance of the richest countries declined 1900 to 2008, and then picked up again a little.

# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
plot_df <- gs_entries_by_country %>% 
  
  # looked performance for two groups of countries
  # top 20% richest countries and the rest
  mutate(top_20_perc = if_else(is.na(top_20_perc), FALSE, top_20_perc)) %>% 
  group_by(year, top_20_perc) %>% 
  summarise(num_first_rd = sum(num_first_rd)) %>% 
  ungroup() %>%
  
  # calculate proportions from counts
  group_by(year) %>% 
  mutate(perc_first_round = num_first_rd / sum(num_first_rd)) %>% 
  ungroup()

# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------  
p <-  ggplot(plot_df) +
  
  # core chart
  geom_area(aes(year, perc_first_round, fill = top_20_perc)) +
  
  # format axis
    scale_y_continuous(labels = scales::percent_format(accuracy = 1L),
                       expand = c(0,0)) +
    scale_x_continuous(expand = c(0,0), 
                       breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
    coord_cartesian(clip = 'off') +
  
  # tidy up presentation
    labs(x = NULL, y = NULL) +
    theme(legend.position = "none",
          plot.margin = margin(25,25,25,25)) 

p 

# ----------------------------------------------------------------------------
# Export the plot for editing
# ----------------------------------------------------------------------------
ggsave("../images/all_gs.svg", units = "mm", width = 600, height = 325)

Is the overall trend driven by the composition of 20% richest countries changing over time?

top_twenty_countries <- gs_entries_by_country %>% 
  filter(top_20_perc) %>% 
  count(year, country, wt = num_first_rd) %>% 
  arrange(year, desc(n)) %>% 
  group_by(country) %>% 
  mutate(country_ave_n = mean(n)) %>% 
  ungroup()

country_ave_ns <- top_twenty_countries %>% 
  distinct(country, country_ave_n)
  
# look at how many time the countries appear in the top twenty percent
top_twenty_counts <- top_twenty_countries %>% 
  count(country) %>% 
  left_join(country_ave_ns) %>% 
  arrange(desc(n), desc(country_ave_n)) %>% 
  rename(years_top_20 = n)
  

# for ordering exploratory plot
levels <- top_twenty_counts$country
  
# ggplot(top_twenty_countries) +
#   geom_tile(aes(x = year, 
#                 y = factor(country, levels = rev(levels)),
#                 fill = n),
#             colour = "#F8F7F7") +
#   labs(x= NULL, y = NULL) +
#   
#   coord_equal() +
#   
#   scale_fill_gradient(low = "#D6DCE0", high = "#000DA8", trans = "log") +
#   theme_minimal() +
#   theme(legend.position = "none")


# create a grid to see where countries have moved out of top 20 perc
grid <-  expand_grid(year = unique(top_twenty_countries$year),
                     country = unique(top_twenty_countries$country))
plot_df <- grid %>% 
  left_join(top_twenty_countries)

# deciles by year
bottom_80_countries_by_year <- gs_entries_by_country %>% 
  distinct(year, country, income_decile) %>% 
  filter(income_decile < 9)

check_bottom_80 <- function(year, country){

  selector <- bottom_80_countries_by_year$year == year & 
              bottom_80_countries_by_year$country == country
  
  res <- bottom_80_countries_by_year[selector, ]
  
  if(nrow(res) == 0){
    return(FALSE)
    }
  else {
    return(res[[1,"income_decile"]] < 9)
  }
}

check_bottom_80(1990, "Nigeria")
## [1] TRUE
plot_df_1 <- plot_df %>% 
  mutate(in_bottom_80 = map2_lgl(year, country, ~check_bottom_80(.x,.y)),
         n = if_else(is.na(n) & in_bottom_80,
                     -1, n),
         n = replace_na(n, 0),
         bin_n = cut(n, breaks = c(-Inf, -1e-10,0,1e10, 10, 50, 100, Inf)))  


p <-   ggplot(plot_df_1) +
    geom_tile(aes(x = year, 
                  y = factor(country, levels = rev(levels)),
                  fill = bin_n),
              colour = "#F8F7F7") +
    labs(x= NULL, y = NULL) +
    
    coord_equal() +
    guides(fill = guide_legend(reverse=TRUE)) +
    
    
      # https://gka.github.io/palettes/#/5|s|ffffff,35469d|ffffe0,ff005e,93003a|1|1
  scale_fill_manual(values = c("#E7E4E5", "white", '#cfcde7', '#9f9dce', '#6e70b6', '#35469d')) +
    
    #scale_fill_gradient(low = "#D6DCE0", high = "#000DA8") +
    scale_x_continuous(position = "top") +
    theme_minimal()
  
p

ggsave("../images/image_4.svg")

# a quick test
gs_entries_by_country %>% 
  filter(country == "Israel") %>% 
  distinct(year, income_decile)
## # A tibble: 29 × 2
##     year income_decile
##    <dbl>         <dbl>
##  1  1990             8
##  2  1991             8
##  3  1992             8
##  4  1993             8
##  5  1994             8
##  6  1995             8
##  7  1996             8
##  8  1997             8
##  9  1998             8
## 10  1999             8
## # … with 19 more rows

downward trend

gs_entries_by_country_clean <- gs_entries_by_country %>% 
  mutate(top_20_perc = if_else(is.na(top_20_perc), FALSE, top_20_perc))

all_gs_entries_by_country <- gs_entries_by_country_clean %>% 
  
  group_by(year, country) %>% 
  summarise(num_first_rd_year = sum(num_first_rd)) %>% 
  ungroup() %>% 
  
  left_join(distinct(gs_entries_by_country_clean, country, year, top_20_perc)) %>%
  
  # select(-c(tourney_name, num_first_rd)) %>%
  # distinct() %>%

  group_by(year) %>%
  mutate(perc_first_round = num_first_rd_year / sum(num_first_rd_year)) %>%
  ungroup()

all_gs_entries_by_country %>% 
  group_by(year, top_20_perc) %>% 
  summarise(perc_first_round = sum(perc_first_round))
## # A tibble: 60 × 3
## # Groups:   year [30]
##     year top_20_perc perc_first_round
##    <dbl> <lgl>                  <dbl>
##  1  1990 FALSE                  0.218
##  2  1990 TRUE                   0.782
##  3  1991 FALSE                  0.222
##  4  1991 TRUE                   0.778
##  5  1992 FALSE                  0.228
##  6  1992 TRUE                   0.772
##  7  1993 FALSE                  0.240
##  8  1993 TRUE                   0.760
##  9  1994 FALSE                  0.273
## 10  1994 TRUE                   0.727
## # … with 50 more rows
gs_entries_top_20_perc <- all_gs_entries_by_country %>% 
  filter(top_20_perc)

grid <- expand_grid(year = unique(gs_entries_top_20_perc$year),
                    country = unique(gs_entries_top_20_perc$country))
  
plot_df <- grid %>% 
  left_join(gs_entries_top_20_perc) %>% 
  mutate(perc_first_round = replace_na(perc_first_round, 0)) %>% 
  filter(year <= 2008) %>% 
  mutate(is_usa = country == "United States")
  
change_df <- plot_df %>% 
  filter(year == max(year) | year == min(year)) %>% 
  select(year, country, perc_first_round) %>% 
  
  pivot_wider(names_from = year,
              values_from = perc_first_round,
              values_fill = 0) %>% 
  mutate(change = `2008` - `1990`,
         fall = change < 0,
         change_bin = cut(change, breaks = c(-Inf, -0.1, -0.02, 0.02, 0.1, Inf))) %>% 
  arrange(change)

plot_df_1 <- plot_df %>% 
  left_join(change_df) %>% 
  group_by(country) %>% 
  mutate(ave_num_first_round = mean(num_first_rd_year, na.rm = TRUE)) %>% 
  ungroup() %>% 
  # mutate(country = fct_reorder(country, ave_num_first_round, change,
  #                              .fun = "median"))
  arrange(fall, desc(change)) %>% 
  mutate(country = factor(country, levels = unique(country)),
         )


#visdat::vis_miss(plot_df, warn_large_data = FALSE)

p <- ggplot(plot_df_1) +
  geom_area(aes(year, perc_first_round,
                group = country,
                fill = change_bin),
            colour = "grey80", size = 0.2) +
  #scale_fill_gradient2(low = "#C94A54", mid = "white", high = "#35469D") #+
  # reds from https://gka.github.io/palettes/#/8|s|c94a54,fffff0|ffffe0,ff005e,93003a|1|1
  # blue from https://gka.github.io/palettes/#/8|s|35469d,fffff0|ffffe0,ff005e,93003a|1|1
    scale_fill_manual(values = c("#c94a54", "#efb3aa", "#fffff0", "#aeabcd")) +
  
    scale_y_continuous(labels = scales::percent_format(accuracy = 1L),
                       breaks = c(seq(0,0.8,0.1)),
                       expand = expansion(mult = c(0, .1))) +
  
    scale_x_continuous(expand = c(0,0), 
                       breaks=c(1990,1995,2000,2005,2008)) +
  
    # coord_cartesian(clip = 'off') +
  labs(x = NULL, y = NULL) +
  theme(legend.position = "none",
        plot.margin = margin(25,25,25,25)) 
  #facet_wrap(~is_usa)

p

ggsave("../images/image_5.svg", units = "mm", width = 525, height = 350)

plotly::ggplotly(p)
levels(plot_df_1$change_bin)
## [1] "(-Inf,-0.1]"  "(-0.1,-0.02]" "(-0.02,0.02]" "(0.02,0.1]"   "(0.1, Inf]"
# get data points for plot annotation
plot_df_1 %>% 
  filter(country %in% c("United States", "Australia", "Sweden")) %>% 
  group_by(year) %>% 
  summarise(perc_first_round_tot = sum(perc_first_round))
## # A tibble: 19 × 2
##     year perc_first_round_tot
##    <dbl>                <dbl>
##  1  1990                0.395
##  2  1991                0.351
##  3  1992                0.326
##  4  1993                0.332
##  5  1994                0.306
##  6  1995                0.285
##  7  1996                0.264
##  8  1997                0.236
##  9  1998                0.233
## 10  1999                0.220
## 11  2000                0.217
## 12  2001                0.195
## 13  2002                0.184
## 14  2003                0.196
## 15  2004                0.179
## 16  2005                0.166
## 17  2006                0.151
## 18  2007                0.156
## 19  2008                0.139
plot_df_1 %>% 
  filter(country == "United States") %>% 
  group_by(year) %>% 
  summarise(perc_first_round_tot = sum(perc_first_round))
## # A tibble: 19 × 2
##     year perc_first_round_tot
##    <dbl>                <dbl>
##  1  1990               0.253 
##  2  1991               0.226 
##  3  1992               0.218 
##  4  1993               0.216 
##  5  1994               0.192 
##  6  1995               0.183 
##  7  1996               0.151 
##  8  1997               0.140 
##  9  1998               0.134 
## 10  1999               0.133 
## 11  2000               0.128 
## 12  2001               0.118 
## 13  2002               0.122 
## 14  2003               0.132 
## 15  2004               0.120 
## 16  2005               0.109 
## 17  2006               0.100 
## 18  2007               0.0994
## 19  2008               0.0907
# get players for annotation

get_players <- function(country_str, year_int){
  gs_first_round_gdp %>% 
  filter(country == country_str & year == year_int) %>% 
  distinct(name)
}

get_players("United States", 1990)
## # A tibble: 106 × 1
##    name             
##    <chr>            
##  1 Jim Pugh         
##  2 Ivan Lendl       
##  3 Tim Wilkison     
##  4 Todd Witsken     
##  5 Glenn Layendecker
##  6 Jimmy Brown      
##  7 John McEnroe     
##  8 Dan Goldie       
##  9 Leif Shiras      
## 10 Richey Reneberg  
## # … with 96 more rows
get_players("United States", 2008)
## # A tibble: 41 × 1
##    name            
##    <chr>           
##  1 Sam Querrey     
##  2 Vincent Spadea  
##  3 Donald Young    
##  4 Robby Ginepri   
##  5 Scoville Jenkins
##  6 Mardy Fish      
##  7 James Blake     
##  8 Bobby Reynolds  
##  9 Wayne Odesnik   
## 10 John Isner      
## # … with 31 more rows

upward trend

plot_df <- grid %>% 
  left_join(gs_entries_top_20_perc) %>% 
  mutate(perc_first_round = replace_na(perc_first_round, 0)) %>% 
  filter(year > 2008)

change_df <- plot_df %>% 
  filter(year == max(year) | year == min(year)) %>% 
  select(year, country, perc_first_round) %>% 
  
  pivot_wider(names_from = year,
              values_from = perc_first_round,
              values_fill = 0) %>% 
  mutate(change = `2019` - `2009`,
         fall = change < 0,
         change_bin = cut(change, breaks = c(-Inf, -0.1, -0.02, 0.02, 0.1, Inf))) %>% 
  arrange(change)

countries_of_int <-  c("United States", "France", "Sweden", "Australia", "Spain")

plot_df_1 <- plot_df %>% 
  left_join(change_df) %>% 
  mutate(country = if_else(country %in% countries_of_int, country, "Other")) %>% 
  group_by(country, year) %>%
  summarise(perc_first_round = sum(perc_first_round)) %>% 
  ungroup()
  
facet_order <- c("United States", "Australia", "Sweden", "France", "Spain", "Other")

# annotation df
annotation_df <- plot_df_1 %>% 
  mutate(label = round(perc_first_round * 100, 1),
         num_appearances = round(perc_first_round * 256)) %>% 
  filter(year == max(plot_df_1$year)|
           year == min(plot_df_1$year))

p <- ggplot(plot_df_1,
            aes(year, perc_first_round)) +
  geom_area(aes(group = country),
            colour = "grey80", size = 0.2) +
  ggrepel::geom_text_repel(data = annotation_df,
                           mapping = aes(label = num_appearances)) +
  
  facet_wrap(~factor(country, levels = facet_order))

p

country affects

gs_first_round_gdp %>% 

  count(year, country, iso) %>% 
  group_by(year) %>% 
  mutate(perc_appear = n / sum(n)) %>% 
  
  filter(iso %in% c("USA", "FRA", "ESP")) %>% 
  
  ggplot() +
  geom_line(aes(year, perc_appear, colour = country))

country_group_counts <- gs_first_round_gdp %>% 
  count(year, country, iso) %>% 
  mutate(colour = if_else(
    iso %in% c("USA", "FRA", "ESP"), iso, "other"
  )) 

country_group_counts %>% 
  
  ggplot() +
  geom_line(aes(year, n, group = country, colour = colour)) +
      scale_y_continuous(expand = c(0,0)) +
    scale_x_continuous(expand = c(0,0), 
                       breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
    coord_cartesian(clip = 'off') +
  labs(x = NULL, y = NULL) +
  theme(legend.position = "none",
        plot.margin = margin(25,25,25,25),
        panel.grid.minor = element_blank()) 

ggsave("../images/image_7.svg", units = "mm", width = 525, height = 350)

country_group_counts %>% 
  filter(iso == "USA")
## # A tibble: 30 × 5
##     year country       iso       n colour
##    <dbl> <chr>         <chr> <int> <chr> 
##  1  1990 United States USA     262 USA   
##  2  1991 United States USA     232 USA   
##  3  1992 United States USA     224 USA   
##  4  1993 United States USA     222 USA   
##  5  1994 United States USA     198 USA   
##  6  1995 United States USA     188 USA   
##  7  1996 United States USA     156 USA   
##  8  1997 United States USA     144 USA   
##  9  1998 United States USA     139 USA   
## 10  1999 United States USA     138 USA   
## # … with 20 more rows
top_3 <- gs_first_round_gdp %>% 
  count(year, country) %>% 
  arrange(year, desc(n)) %>% 
  group_by(year) %>% 
  mutate(rank = rank(-n)) %>% 
  ungroup() %>% 
  
  filter(rank == 1 | rank == 2 | rank == 3)

Countries outside top twenty percent

gs_entries_the_other_80 <- gs_entries_by_country %>% 
  filter(!top_20_perc) %>% 
  count(year, country, wt = num_first_rd)

grid <- expand_grid(country = unique(gs_entries_the_other_80$country),
                    year = unique(gs_entries_the_other_80$year))

plot_df <- grid %>%
  left_join(gs_entries_the_other_80) %>%
  mutate(n = replace_na(n, 0))

p <- ggplot(plot_df,
       aes(year, n, group = country)) +
  geom_line()

p

plotly::ggplotly(p)
p <- ggplot(plot_df,
       aes(year, n, fill = country)) +
  geom_area()

p

plotly::ggplotly(p)
gs_entries_by_country %>% 
  filter(str_detect(str_to_lower(country), "cz"))
## # A tibble: 120 × 8
##     year tourney_name    country_code country gdp_per_…¹ num_f…² incom…³ top_2…⁴
##    <dbl> <chr>           <chr>        <chr>        <dbl>   <dbl>   <dbl> <lgl>  
##  1  1990 Australian Open CZE          Czechia     23585.      10       8 FALSE  
##  2  1990 Roland Garros   CZE          Czechia     23585.      11       8 FALSE  
##  3  1990 US Open         CZE          Czechia     23585.      10       8 FALSE  
##  4  1990 Wimbledon       CZE          Czechia     23585.      10       8 FALSE  
##  5  1991 Australian Open CZE          Czechia     20896.      11       8 FALSE  
##  6  1991 Roland Garros   CZE          Czechia     20896.      10       8 FALSE  
##  7  1991 US Open         CZE          Czechia     20896.       8       8 FALSE  
##  8  1991 Wimbledon       CZE          Czechia     20896.       9       8 FALSE  
##  9  1992 Australian Open CZE          Czechia     20769.       9       8 FALSE  
## 10  1992 Roland Garros   CZE          Czechia     20769.       7       8 FALSE  
## # … with 110 more rows, and abbreviated variable names ¹​gdp_per_capita,
## #   ²​num_first_rd, ³​income_decile, ⁴​top_20_perc
countries_of_int <- c("Russia", "Argentina", "Czechia")

plot_df %>% 
  
  mutate(colour = if_else(country %in% countries_of_int,
                          country, "other")) %>% 
  
  filter(!(country == "Czechia" & year >= 2017)) %>% 
  
  ggplot(aes(year, n, 
             colour = colour, 
             group = country)) +
    geom_line() +
  
    scale_colour_manual(values = c("#A7BCD6", "#35469D", "grey95", "#C94A54")) +
          scale_y_continuous(expand = c(0,0)) +
    scale_x_continuous(expand = c(0,0), 
                       breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
    coord_cartesian(clip = 'off') +
  labs(x = NULL, y = NULL) +
  theme(legend.position = "none",
        plot.margin = margin(25,25,25,25),
        panel.grid.minor = element_blank()) 

ggsave("../images/image_8.svg", units = "mm", width = 525, height = 350)

Bottom 50 percent

bottom_50_perc <- gs_entries_by_country %>% 
  filter(income_decile <= 5)

# num countries with first round appearances 
bottom_50_perc %>% 
  distinct(year, country) %>% 
  count(year) %>% 
  ggplot() +
    geom_col(aes(year, n), width = 0.8) +
            scale_y_continuous(expand = expansion(mult = c(0, .1))) +
    scale_x_continuous(expand = c(0,0), 
                       breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
    coord_cartesian(clip = 'off') +
  labs(x = NULL, y = NULL) +
  theme(legend.position = "none",
        plot.margin = margin(25,25,25,25),
        panel.grid.minor = element_blank()) 

ggsave("../images/image_9.svg", units = "mm", width = 525, height = 350)

# for annotation
bottom_50_perc %>% 
  distinct(year, country) %>% 
  filter(year == 1990 | year == 2013)
## # A tibble: 7 × 2
##    year country   
##   <dbl> <chr>     
## 1  1990 India     
## 2  1990 Nigeria   
## 3  1990 Peru      
## 4  2013 China     
## 5  2013 Uzbekistan
## 6  2013 India     
## 7  2013 Georgia
# countries from bottom 50 percent with most appearances in first round 1990 - 2019
top_n <- 5

top_n_countries <- bottom_50_perc %>% 
  count(country) %>% 
  slice_max(order_by = n, n = top_n) %>% 
  .$country

bottom_50_perc %>% 
  distinct(year,country) %>% 
  filter(year == max(year)) %>% 
  .$country
## [1] "Ukraine"      "South Africa" "Tunisia"      "India"        "Uzbekistan"  
## [6] "Moldova"      "Bolivia"
plot_df <- bottom_50_perc %>% 
  count(year, country) %>% 
  mutate(country = if_else(country %in% top_n_countries,
                           country,
                           "Other")) %>% 
  group_by(year, country) %>% 
  summarise(n = sum(n))

ggplot(plot_df) +
  geom_col(aes(year, n, fill = country))

# for story text
bottom_50_perc %>% 
  distinct(year, country) %>% 
  filter(year == 2003)
## # A tibble: 17 × 2
##     year country               
##    <dbl> <chr>                 
##  1  2003 Belarus               
##  2  2003 Morocco               
##  3  2003 Indonesia             
##  4  2003 Philippines           
##  5  2003 Peru                  
##  6  2003 Armenia               
##  7  2003 Ecuador               
##  8  2003 Georgia               
##  9  2003 Paraguay              
## 10  2003 Uzbekistan            
## 11  2003 Zimbabwe              
## 12  2003 Colombia              
## 13  2003 Madagascar            
## 14  2003 Ukraine               
## 15  2003 Bosnia and Herzegovina
## 16  2003 China                 
## 17  2003 Tunisia
bottom_50_perc_country_counts <- bottom_50_perc %>% 
  count(year, country, wt = num_first_rd) %>% 
  group_by(country) %>% 
  mutate(country_ave_n = mean(n, na.rm = TRUE)) %>% 
  ungroup()


# look at how many times the countries appear in the bottom fifty percent
bottom_50_summary <- bottom_50_perc_country_counts %>% 
  count(country) %>% 
  rename(total_n = n) %>% 
  left_join(distinct(
              bottom_50_perc_country_counts,
              country,
              country_ave_n
              )) %>% 
  arrange(desc(country_ave_n))

# bottom_50_perc_country_counts <- bottom_50_perc_country_counts %>% 
#   left_join(bottom_50_summary)
  
# create a grid to see where countries have moved out of bottom 50 perc
grid <-  expand_grid(year = unique(bottom_50_perc_country_counts$year),
                     country = unique(bottom_50_perc_country_counts$country))
plot_df <- grid %>% 
  left_join(bottom_50_perc_country_counts)

# deciles by year
top_50_countries_by_year <- gs_entries_by_country %>% 
  distinct(year, country, income_decile) %>% 
  filter(income_decile > 5)

check_top_50 <- function(year, country){

  selector <- top_50_countries_by_year$year == year & 
              top_50_countries_by_year$country == country
  
  res <- top_50_countries_by_year[selector, ]
  
  if(nrow(res) == 0){
    return(FALSE)
    }
  else {
    return(res[[1,"income_decile"]] > 5)
  }
}

check_top_50(1991, "United States")
## [1] TRUE
plot_df_1 <- plot_df %>% 
  mutate(in_top_50 = map2_lgl(year, country, ~check_top_50(.x,.y)),
         n = if_else(is.na(n) & in_top_50,
                     -1, n),
         n = replace_na(n, 0),
         bin_n = cut(n, breaks = c(-Inf, -1e-10,0,1e10, 5, 10, 20, Inf)))


# for ordering exploratory plot
levels <- rev(bottom_50_summary$country)

ggplot(plot_df_1) +
  geom_tile(aes(x = year, 
                y = factor(country, levels = levels),
                fill = bin_n),
            colour = "#E7E4E5") +
  labs(x= NULL, y = NULL) +
  
  coord_equal() +
  
  # https://gka.github.io/palettes/#/5|s|ffffff,35469d|ffffe0,ff005e,93003a|1|1
  scale_fill_manual(values = c("#E7E4E5", "white", '#cfcde7', '#9f9dce', '#6e70b6', '#35469d')) +
  
  #scale_fill_gradient(low = "#D6DCE0", high = "#000DA8") +
  theme_minimal() +
  theme()

ggsave("test_out.svg")
library(ggridges)

bottom_50_countries <- bottom_50_summary %>% 
  slice_head(n = 10) %>% 
  .$country

gs_first_round_gdp %>% 
  filter(country %in% bottom_50_countries) %>% 

  ggplot() +
    ggridges::geom_density_ridges(mapping = aes(year, 
                                      factor(country, levels = levels),
                                      height = stat(density)),
                                  stat = "density")